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MULTI-MICROPHONE ADAPTIVE NOISE REDUCTION TECHNIQUES 
-FOR SPEECH ENHANCEMENT 



I. Background c 

In speech communication applications, such as teleconferencing, hands-tree telephony and hearing aids, 
the presence of background noise and/or reverberation may significantly reduce the intelligibility of the de- 
sired speech signal. This stems from the large distance between the speaker and the microphone(s). Hence, 
the use of a noise reduction algorithm is necessary. Multi-microphone systems exploit spatial informa- 
tion in addition to temporal and spectral information of the desired signal and noise signal and are thus 
preferred to single microphone procedures (such as spectral subtraction). Because of aesthetical reasons, 
multi-microphone techniques for e.g., hearing aid applications go together with the use of small-sized ar- 
rays. Considerable noise reduction can be achieved with such arrays, but at the expense of an increased 
sensitivity to errors in the assumed signal model such as microphone mismatch, reverberation, ... [1, 2] In 
hearing aids, microphones are rarely matched in gain and phase. In [3], e.g., gain and phase differences 
between microphone characteristics of up to 6 dB and 10°, respectively, have been reported. 

A widely studied multi-channel adaptive noise reduction algorithm is the Generalized Sidelobe Can- 
celler (GSC) [2]-[ll], depicted in Figure 1. The GSC consists of a fixed, spatial pre-processor, which 
includes a fixed beamformer and a blocking matrix, and an adaptive stage based on an Adaptive Noise Can- 
celler (ANC) [12]. The ANC minimizes the output noise power while the blocking matrix should avoid 
speech leakage into the noise references. The standard GSC assumes the desired speaker location, the mi- 
crophone characteristics and positions to be known, and reflections of the speech signal to be absent. If these 
assumptions are fulfilled, it provides ah undistorted enhanced speech signal with nimimum residual noise. 
However, in reality these assumptions are often violated, resulting in so-called speech leakage and hence 
speech distortion. To limit speech distortion, the ANC is adapted during periods of noise only [7, 10, 13]. 
When used in combination with small-sized arrays, e.g., in hearing aid applications, an additional robustness 
constraint [9, 10, 14, 15] is required to guarantee performance in the presence of small errors in the assumed 
signal model, such as microphone mismatch [16, 17]. A widely applied method consists of imposing a 
Quadratic Inequality Constraint to the ANC (QIC-GSC) [10, 14, 15, 18, 19]. For LMS updating, the Scaled 
Projection Algorithm (SPA) [14] is a simple and effective technique that imposes this constraint However, 
the QIC-GSC goes at the expense of less noise reduction [17]. 

In [20], uMidti-channel Wiener Filtering (MWF) technique has been proposed that provides a Minimum 
Mean Square Error (MMSE) estimate of the desired signal portion in one of the received microphone signals 
[21]-[24]. In contrast to the ANC of the GSC, the MWF is able to take speech distortion into account in 
its optimization criterion. The MMSE optimization criterion of the MWF can also be generalized to allow 
for a trade-off between speech distortion and noise reduction. We will refer to this generalization as Speech 
Distortion Weighted MWF (SDW-MWF). The MWF technique is uniquely based on estimates of the second 
order statistics of the recorded speech signal and the noise signal. A robust speech detection is thus (again) 
needed In contrast to the GSC, the MWF does not make any a priori assumptions about the signal model so 
that no or a less severe robustness constraint is needed to guarantee performance when used in combination 
with small-sized arrays [16, 17]. Especially in complicated noise scenarios such as multiple noise sources 



or diffuse noise, the MWF outperforms the GSC, even when the GSC is supplemented with a robustness 
constraint [17]. 

In [20, 21], the implementation of the MWF is based on a Generalized Singular Value Decomposition 
(GSVD) of an input data matrix and a noise data matrix. A cheaper alternative based on a QR Decompo- 
sition (QRD) has been proposed in [22]. A subband implementation [23] results in improved intelligibility 
at a significantly lower cost compared to the fullband approach. However, in contrast to the GSC and the 
QIC-GSC [14], no efficient, cheap stochastic gradient based implementation of the (SDW-)MWF, which 
avoids the use of expensive matrix computations, is available yet. In [25], an LMS based algorithm for the 
MWF has been developed. The algorithm needs recordings of calibration signals. Since room acoustics, 
microphone characteristics and the location of the desired speaker change over time, frequent re-calibration 
is required, making this approach cumbersome and expensive. In [26], an LMS based SDW-MWF has 
been proposed that avoids the need for calibration signals. The algorithm however relies on some indepen- 
dence assumptions that are not necessarily satisfied, resulting in degraded performance w.r.t matrix-based 
implementations. 

n. Summary 

In the present invention, we establish a generalized multi-channel noise reduction scheme, referred to 
as Spatially Pre-processed Speech Distortion Weighted Multi-channel Wiener Filter (SP-SDW-MWF), that 
encompasses the GSC and the MWF as extreme cases. In addition, the scheme allows for in-between 
solutions such as the Speech Distortion Regularized GSC (SDR-GSC). The generalized scheme, depicted 
in Figure 3, consists of a fixed, spatial pre-processor and an adaptive stage that is based on an SDW-MWF, 
hence the name Spatially Pre-processed Speech Distortion Weighted Multi-channel Wiener filter (SP-SDW- 
MWF). 

The SP-SDW-MWF adds robustness against signal model errors to the GSC by taking speech distortion 
explicitly into account in the design criterion of the adaptive stage. The SP-SDW-MWF is an alternative 
technique to the widely studied QIC-GSC to decrease the sensitivity of the GSC to signal model errors 
such as microphone mismatch, reverberation, ... A parameter p. is incorporated in the SP-SDW-MWF that 
allows for a trade-off between speech distortion and noise reduction. Focussing all attention towards speech 
distortion (i.e., setting p. = 0) results in the output of the fixed beamformer. In noise scenarios with very 
low Signal-to-Noise Ratio (SNR), e.g., —10 dB, a fixed beamformer may be preferred. Adaptivity can then 
be easily reduced or excluded in the SP-SDW-MWF by decreasing the parameter p, to 0. Compared to the 
widely studied QIC-GSC, the SP-SDW-MWF achieves a better noise reduction performance for a given 
maximum allowable speech distortion level. 

In [22, 27] recursive implementations of the (SDW-)MWF have been proposed based on a GSVD or QR 
decomposition. A subband implementation [28] results in improved intelligibility at a significantly lower 
cost compared to the fullband approach. These techniques can be extended to implement the SP-SDW- 
MWF [29]. However, in contrast to the GSC and the QIC-GSC [14], no cheap stochastic gradient based 



implementation of the SP-SDW-MWF is available. In the present invention, we propose time-domain and 
frequency-domain stochastic gradient implementations of the SP-SDW-MWF that preserve the benefit of 
the matrix-based SP-SDW-MWF over QIC-GSC. 

Below, the different embodiments of the present invention are described. 

A first embodiment proposes a Speech Distortion Regularized GSC (SDR-GSC). A new design criterion 
is developed for the adaptive stage of the GSC: the ANC design criterion is supplemented- with a regular- 
ization term that limits speech distortion due to signal model errors. In the SDR-GSC, a parameter p, is 
incorporated that allows for a trade-off between speech distortion and noise reduction. Focussing all atten- 
tion to noise reduction, results in the standard GSC, while, on the other hand, focussing all attention towards 
speech distortion results in the output of the fixed beamformer. In noise scenarios with low SNR, adaptivity 
in the SDR-GSC can be easily reduced or excluded by increasing attention towards speech distortion, i.e., by 
decreasing the parameter p to 0. The SDR-GSC is an alternative technique to the QIC-GSC to decrease the 
sensitivity of the GSC to signal model errors such as microphone mismatch, reverberation, .... In contrast to 
the QIC-GSC, the SDR-GSC shifts emphasis towards speech distortion when the amount of speech leakage 
grows. In the absence of signal model errors, the performance of the GSC is preserved. As a result, a better 
noise reduction performance is obtained for small model errors, while guaranteeing robustness against large 
model errors. 

In a second embodiment, we further improve the noise reduction performance of the SDR-GSC by 
adding an extra adaptive filtering operation wo on the speech reference signal. We refer to this general- 
ized scheme as Spatially Pre-processed Speech Distortion Weighted Multi-channel Wiener Filter (SP-SDW- 
MWF). The SP-SDW-MWF is depicted in Figure 3 and encompasses the MWF [20] as a special case. Again, 
a parameter p is incorporated in the design criterion to allow for a trade-off between speech distortion and 
noise reduction. Focussing all attention to speech distortion, results in the output of the fixed beamformer. 
Also here, adaptivity can be easily reduced or excluded by decreasing p to 0. It is shown that -in the ab- 
sence of speech leakage and for infinitely long filter lengths- the SP-SDW-MWF corresponds to a cascade 
of a SDR-GSC with a SDW single channel Wiener postfilter (SDW-SWF) [30] and thus outperforms the 
SDR-GSC. In the presence of speech leakage, the SP-SDW-MWF with wq tries to preserve its performance: 
compared to a SDR-GSC (with SDW-SWF postfilter), the SP-SDW-MWF then contains extra filtering op- 
erations that compensate for the performance degradation of the SDR-GSC (with SDW-SWF) due to speech 
leakage (see also Figure 4). . In contrast to the SDR-GSC (and thus also the GSC), performance does not 
degrade due to microphone mismatch. In [22, 27] recursive implementations of the (SDW-)MWF have been 
proposed based on a GSVD or QR decomposition. A subband implementation [28] results in improved 
intelligibility at a significantly lower cost compared to the fullband approach. These techniques can be 
extended to implement the SDR-GSC and, more generally, the SP-SDW-MWF. 

In a third embodiment, we propose cheap time-domain and frequency-domain stochastic gradient im- 
plementations of the SDR-GSC and SP-SDW-MWF. Starting from the design criterion of the SDR-GSC, 
or more generally, the SP-SDW-MWF, we derive a time-domain stochastic gradient algorithm. In addition, 



we modify the LMS based algorithm [26] so that it applies to the SP-SDW-MWF. To increase convergence 
and reduce complexity, a n-equency-dornain implementation has been proposed. Both, the stochastic gra- 
dient and LMS based algorithm suffer from a large excess error when applied in highly time-varying noise 
scenarios. We show that the excess error in the stochastic gradient algorithm is reduced by applying a low 
pass filter to the part of the gradient estimate that limits speech distortion. The low pass filtering avoids 
a highly time-varying distortion of the desired speech component while not degrading the tracking perfor- 
mance needed in time-varying noise scenarios. The stochastic gradient SP-SDW-MWF outperforms the 
LMS based algorithm, while complexity is not increased. Experimental results show that the low pass filter- 
ing significantly improves the performance of the stochastic gradient algorithm and does not compromise the 
tracking of changes in the noise scenario. In addition, experiments demonstrate that the proposed stochastic 
gradient algorithm preserves the benefit of the SP-SDW-MWF over QIC-GSC. The limited computational 
cost and the better noise reduction performance of the proposed algorithm make it a good alternative to the 
SPA [14] for implementation in hearing aids. 



Brief Description of the Drawings 

A number of embodiments of the present invention, together with some aspects of 
the prior art will now be described with reference to the drawings, in which: 
Fig. 1 depicts the concept of a Generalized Sidelobe Canceller; 
Fig. 2 depicts an equivalent approach of multi-channel Wiener filtering; 
Fig. 3 depicts a Spatially Pre-processed SDW MWF; 

Fig. 4 depicts the decomposition of SP-SDW-MWF with w 0 in a multi-channel filter 
Wd and single-channel postfiiter ei - wo; 

Fig. 5 shows the influence of on the performance of the SDR GSC for different 
gain mismatches Y 2 at the second microphone; 

Fig. 6 shows the influence of on the performance of the SP SDW MWF with wo 
for different gain mismatches Y 2 at the second microphone; 

Fig. 7 shows the ASNR intc mg and SD^iHg for QIC-GSC as a function of J? for 
different gain mismatches Y 2 at the second microphone; 

Fig. 8 depicts the complexity of TD and FD Stochastic Gradient (SG) algorithm with 
LP filtering as a function of filter length L per channel; M = 3 (for comparison, the 
complexity of the standard NLMS ANC and SPA are depicted too); 

Fig. 9 depicts the performance of different FD Stochastic Gradient (FD-SG) 
algorithms; (a) Stationary speechlike noise at 90°; (b) Multi-talker babble noise at 90°; 

Fig. 10 depicts the influence of LP filter on performance of FD stochastic gradient 
SP-SDW-MWF (1/p. = 0.5) without w 0 and with w 0 . Babble noise at 90°; 

Fig. 11 depicts the convergence behavior of FD-SG for X = 0 and X = 0.9998. The 
noise source position suddenly changes from 90° to 180° and vice versa; 

Fig. 12 depicts the performance of FD stochastic gradient implementation of SP- 
SDW-MWF with LP (X= 0.9998) in a multiple noise source scenario; and 

Fig. 13 depicts the performance of FD SPA in a multiple noise source scenario. 

Detailed Description 

Before the invention is described in detail, the prior art GSC [4] and the QIC-GSC 
[14, 19] will be reviewed under section 1. Under section 2, the Multi-channel Wiener 
Filter (MWF) technique will be discussed [20]. 



1 Generalized Sidelobe Canceller (GSC) 



1.1 Concept 

Figure 1 describes the concept of the Generalized Sidelobe Canceller (GSC) [4], which consists of a fixed, 
spatial pre-processor, i.e., a fixed beamformer A(z) and a blocking matrix B(z), and an ANC. Given M 
microphone signals 

Ui[k) = uf [fc] +ti?[fc], i = 1, M (1) 

with u*[k) the desired speech contribution and u"[fc] the noise contribution, the fixed beamformer A (z) 
(e.g., delay-and-sum) creates a so-called speech reference 

»M-«M*1 +*[*]• (2) 

by steering a beam towards the direction of the desired signal with a speech contribution yg[fc] and a noise 
contribution yfi [fc]. In the sequel an endrire array is assumed and the desired speaker is assumed to be in 
front at 0°. The blocking matrix B(z) creates M — 1 so-called noise references 

wM = itfM + SPM, * = 1, M-l (3) 

by steering zeroes towards the front so that the noise contributions y?[k] are dominant compared to the 
speech leakage contributions yf [fc]. In the sequel, the superscripts s and n are used to refer to the speech 
and noise contribution of a signal. During periods of speech + noise, the references y%[k] t i = 0, M — 1 
contain speech + noise. During periods of noise only, y»[fc], i = 0, M — 1 only consist of a noise 
component, i.e., y»[fc] = y"[fc]. The second order statistics of the noise signal are assumed to be quite 
stationary such that they can be estimated during periods of noise only. 

To design the fixed, spatial pre-processor, assumptions are made about the microphone characteristics, 
the speaker position and the microphone positions and furthermore reverberation is assumed to be absent 
If these assumptions are satisfied, the noise references do not contain any speech, i.e., [fc] = 0, for 
i = 1, M — 1. However, in practice, . the assumptions are often violated (e.g. due to microphone 
mismatch and reverberation) so that speech leaks into the noise references. To limit the effect of such signal 



leakage, the ANC "Wv.m-1 1 



where 



w{? M -i = [ wf wf ... w»_ t ] 
w< = [ it/j[0] Wi[l] ... Wi[L - 1] J^, 



(4) 



(5) 



is adapted during periods of noise only [7, 13]. Hence, the ANC w 1: m -i minimizes the output noise power, 
i.e., 

(6) 



w 1:Af . 1 = arg min - A] - wg^xMyfrif-iMl } 

Wl:W-l 



and equals 



where 



w 1:M -i = f {y?=M-iy^-i>-^{yi'=M- 1 yr [* - a]}, 



y?;£-i[*] = [y?'"[*] y?'"[*l - yJ&M ] 
y?W = v?[k-i) ... y?[fc-L + i]] J 



(7) 

(8) 

(9) 



and where A is a delay applied to the speech reference to allow for non-causal taps in the filter Wy.M-i- The 
delay A is usually set to [f ] , where fx] returns the smallest integer equal or larger than x. The subscript 
1 : M — 1 in "Wi.m-i and yi:M-i refers to the subscripts of the first and last channel component of the 
adaptive filter and input vector, respectively. 

Under ideal conditions (j/f [k] = 0, i = 1, ; , , . M — 1), the GSC minimizes the residual noise while 
not distorting the desired speech signal, i.e., z*[fcj = yf\k — A], However, when used in combination with 
small-sized arrays, a small error in the assumed signal model (hence [k] ^ 0, i = 1, M — 1) already 
suffices to produce a significantly distorted output speech signal z s [k] 



z'[k) = y> 0 [k - A] - wZu.tfiM-dk], 



(10) 



even when only adapting during noise-only periods, so a robustness constraint on w 1: m-i is required [17]. 
In addition, the fixed beamfbrmer A(z) should be designed so that the distortion in the speech reference 
yg[fc] is minimal for all possible model errors. In the sequel, a delay-and-sum beamformer is used. For 
small-sized arrays, this beamformer offers sufficient robustness against signal model errors, as it minimizes 
the white noise gain or noise sensitivity 2 . Given statistical knowledge about the signal model errors that 
occur in practice, further optimized beamformers can be designed, e.g., using the techniques in [31]. 



'in a time-domain implementation, the input signals of the adaptive filter w 1:M -i and the filter wi :A f-i are real. Hence, 
wff M _! = wJ M _ x . In the sequel, the formulas are generalized to complex input signals so that they can also be applied to a 
subband implementation. 

2 The white noise gain or noise sensitivity is denned as the ratio of the spatially white noise gain to the gain of the desired signal 
and is often used to quantify the sensitivity of an algorithm against errors in the assumed signal model [2, 14]. 



t 



1.2 Quadratic Inequality Constraint (QIC-GSC) 

A common approach to increase the robustness of the GSC is to apply a Quadratic Inequality Constraint 
(QIC) [9]-[14, 19] to the ANC filters Wum-u so mat toe optimization criterion (6) of the GSC is modified 
into 



w 1:M -i = arg min £{\y%[k - A] - ^ M ^[k}yl M ^[k}\ 2 } 
subject to wf^.iWi^-i < /3 2 . 



(ID 



The QIC avoids excessive growth of the filter coefficients w. Hence, it reduces the undesired speech distor- 
tion when speech leaks into the noise references. In [14, 19], it is shown that -for a GSC with a blocking 
matrix B(/) that satisfies E H (f)H(f) = I- the QIC on the ANC filters corresponds to a constraint on the 
noise sensitivity. 

In [14], the QIC-GSC is implemented by using the adaptive scaled projection algorithm: at each update 
step, the quadratic constraint is applied to the newly obtained ANC filter by scaling the filter coefficients 
by | | Wl .^_ 1 || when w ?A*-i w i:M-i exceeds 0 2 . Although this technique works well for LMS updating, it 
does not appear to be as effective for RLS as for LMS [19]. Recently, Tian et al. implemented the quadratic 
constraint by using variable hading [19]. For RLS, this technique provides a better approximation to the 
optimal solution (11) than the scaled projection algorithm. For LMS, variable loading does not appear to 
offer any performance advantage over the cheaper, scaled projection LMS. 

2 Multi-channel Wiener filtering (MWF) 
2.1 Concept 

Recently, a Multi-channel Wiener filtering (MWF) technique has been proposed that provides a Minimum 
Mean Square Error (MMSE) estimate of the desired signal portion in one of the received microphone signals 
[2 1 , 22, 23 , 24]. In contrast to the GSC, this filtering technique does not make any a priori assumptions about 
the signal model and is found to be more robust [16, 17, 21]. Especially in complicated noise scenarios such 
as multiple noise sources or diffuse noise, the MWF outperforms the GSC, even when the GSC is supplied 
with a robustness constraint [17]. 

The MWF w 1:Af e C MLxl minimizes the Mean Square Error (MSE) between a delayed version of the 
(unknown) speech signal u* [k - A] at the i-th (e.g., first) microphone and the sum wff M Ui : M[fc] of the M 
filtered, received microphone signals: 



wi:M = arg min £ - A] - w£ M ui :M [fc]| 2 } , 



(12) 



leading to: 



with 



W 1:M = S {Ul:Af[&]u^ M [fc]} ^{uuMlWik ~ A]}, 



w£m = [ Wi w 2 • • • wm ] . 

= [ux[fc] u 2 [fc] ••• u M [k)]", 
Ui[k] = [ui[k] Ui[k-1) ■•• Ui[k-L+1]] T . 



(13) 

(14) 
(15) 
(16) 



An equivalent approach consists in estimating a delayed version of the (unknown) noise signal u? [k — A] 
in the t-th microphone, resulting in 



-WV.M = arg min £ ( \u?[k - A] - w{f M u 1:M [A:] | 2 } , 



and 



where 



wi:m = £iu 1:M [k}ui l :M [k}r 1 £{xx 1:M [k)ur[k - A]}, 



Wi W2 



(17) 



(18) 



(19) 



The estimate of the speech component [k — A] is then obtained by subtracting the estimate u?[h — A] = 
w(f M ui;Af [A:] from the delayed, x-th microphone signal Ui[k — A], i.e. 

uf[k -A] = m[k - A] - w{f M u X:M [*]. (20) 

This is depicted in Figure 2 for i^ffc - A] = v%[k - A]. Using (13) and (18), it can be easily shown that 

with ei the Z-th canonical vector, defined as 

r 0 ..-o^o of (22) 

L position/ J 



This shows that the two approaches indeed lead to exactly the same speech signal estimate. A procedure for 
computing w i: M or wi :W will be given in Section 2.3. 
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2.2 Trade-off speech distortion versus noise reduction (SDW-MWF) 

The residual error energy equals 

£ {K*]| 2 } = £{\<[k - A] - w£ M u 1: M[fc]| 2 }, 

and can be decomposed as 

gfltif [fc - A] - wff M u! iM [fc)| 2 } + £{|w(f M u? :M [A ! ]| 2 } 



(23) 



(24) 



where e\ equals the speech distortion energy and c„ oie residual noise energy. The design criterion of 
the MWF can be generalized to allow for a trade-off between speech distortion and noise reduction, by 
incorporating a weighting factor n [20] with fi e [0, oo] 



w 1:M = arg min £{\t4[k - A] — w{f M uf :M [fe]| 2 } + /^{|w{f M u? :Af [k]\ 2 }. 



The solution of (13) is given by 



wi :M " £ {ul M [k)u^[k) +Mu? :Af [fc)u^[fc]}- 1 £{u? :M [fc]<-*[fc - A]}, 



(25) 



(26) 



which corresponds to the Wiener formula with an adjustable input noise level. Note that (18) is obtained 
with p = 1 and that (21) still applies. The filter (26) corresponds to the time-domain constrained estimator 
proposed in [32], which optimizes the following criterion: 



min e\ subject to < a5{u££u£ M } 

where 0 < a < 1 and is the Lagrange-multiplier. 

Equivalently, the optimization criterion for w in (13) can be modified into 



(27) 



w 1:M = argmin£{|w? M u! :M [fc]| } + »£{\u?[k - A] - w£ M u? :Af [k]\ }, 

W 1:M 



resulting in 



w 1:M = £{vil M [k}^:Z[k} + ^uf :W [fclu^[fc]}- 1 5{u? :M [fc] u , n, *[* - A]}. 

A* 



(28) 



(29) 



In the sequel, we will refer to (29) as the Speech Distortion Weighted Multi-channel Wiener Filter (SDW- 
MWF). 

The factor \i e [0, oo] trades off speech distortion versus noise reduction. If /x = 1, the MMSE criterion 



I 
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(12) or (17) is obtained If \i > 1, the residual noise level will be reduced at the expense of increased speech 
distortion. By setting p to oo, all emphasis is put on noise reduction and speech distortion is completely 
ignored. This results in w = 0 or w = e(i-i)i,+A» which means that the output signal equals 0. Setting \i 
to 0 on the other hand, results in w = e(t-i)L+A or w = 0 and hence in no noise reduction. 

2.3 Implementation of MWF 

In practice, the correlation matrix S{yx\. M [fc]ui!^[fc]} is unknown. During periods of speech, the inputs 
m[k] consist of speech + noise, i.e., tti[k) = uf[k) -f «?[fc], % = 1, M. During periods of noise, 
only the noise component u?[k] is observed. Assuming that the speech and noise signal are uncorrelated, 

^{ u !:M[ fc l u i , :K[ fe ]} can be estimate<i as 

£{ukrt*]} - £{"uM[k]u« M [k}} - fiW:MW«£SW}. (30) 

where the second order statistics £{ui : m [k]u^ M [k]} are estimated during speech + noise and the statistics 
£{ U VM [*l u i-Af [*]) periods of noise only. Like for the GSC, a robust speech detection is thus needed. 

Using (30), (29) and (26) can be re-written as: 



(31) 



and 



W 1:M = (^KAf Wu^[fc]} + ( M - l^U^ [fc]u^ [*]}) 

x (£{u l:M [k)^[k - A]} - £{ufr,[*K*[* - A]}) . 



-l 



(32) 



In [21], the Wiener .filter is computed at each time instant k by means of a Generalized Singular Value 
Decomposition (GS VD) of an speech + noise and noise data matrix. A cheaper recursive alternative based on 
a QR-decomposition has been proposed in [22]. In [23, 24], a subband implementation has been developed 
to increase intelligibility and reduce complexity, making it suitable for hearing aid applications. 

Finally note that instead of estimating £ {uf : mM u i!m ox ^^ using (30), a pre-determined estimate 
of £{ui :M [fc]uJ;j^[fcl} is sometimes used [25, 33]. In [25], this estimate is derived from clean speech 
recordings measured during an initial calibration phase. Additional recordings of the source speech signal 
allow to produce an estimate of the non-reverberant source speech signal instead of an estimate of the 
reverberant speech component in one of the microphone signals. However, since the room acoustics, the 
position of desired speaker and microphone characteristics may change over time, frequent re-calibration 
is required. In [33], a mathematical estimate of the correlation matrix and the correlation vector of the 
non-reverberant speech is exploited in which some signal model errors are taken into account. 
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In this Section, the present invention is described in detail. 

In Section 3, the proposed adaptive multi-channel noise reduction technique, referred to as Spatially 
Pre-processed Speech Distortion Weighted Multi-channel Wiener filter, is described. 

Section 3.2 describes a first embodiment, referred to as Speech Distortion Regularized GSC (SDR- 
GSC). A new design criterion is developed for the adaptive stage of the GSC: the ANC design criterion is 
supplemented with a regularization term that limits speech distortion due to signal model errors. In the SDR- 
GSC, a parameter fi is incorporated that allows for a trade-off between speech distortion and noise reduction. 
Focussing all attention to noise reduction, results in the standard GSC, while, on the other hand, focussing 
all attention towards speech distortion results in the output of the fixed beamformer. In noise scenarios with 
low SNR, adaptivity in the SDR-GSC can be easily reduced or excluded by increasing attention towards 
speech distortion, i.e., by decreasing the parameter p to 0. The SDR-GSC is an alternative technique to 
the QIC-GSC to decrease the sensitivity of the GSC to signal model errors such as microphone mismatch, 
reverberation, .... In contrast to the QIC-GSC, the SDR-GSC shifts emphasis towards speech distortion 
when the amount of speech leakage grows. In the absence of signal model errors, the performance of the 
GSC is preserved. As a result, a better noise reduction performance is obtained for small model errors, while 
guaranteeing robustness against large model errors. 

In a second embodiment, described in Section 3.3, we further improve the noise reduction performance 
of the SDR-GSC by adding an extra adaptive filtering operation w 0 on the speech reference signal. We refer 
to this generalized scheme as Spatially Pre-processed Speech Distortion Weighted Multi-channel Wiener 
Filter (SP-SDW-MWF). The SP-SDW-MWF is depicted in Figure 3 and encompasses the MWF as a special 
case. Again, a parameter (jl is incorporated in the design criterion to allow for a trade-off between speech 
distortion and noise reduction. Focussing all attention to speech distortion, results in the output of the fixed 
beamformer. Also here, adaptivity can be easily reduced or excluded by decreasing p. to 0. It is shown 
that -in the absence of speechleakage and for infinitely long filter lengths- the SP-SDW-MWF corresponds 
to a cascade of a SDR-GSC with a SDW-SWF postfilter. In the presence of speech leakage, the SP-SDW- 
MWF with w 0 tries to preserve its performance: compared to a SDR-GSC with SDW-SWF postfilter, the 
SP-SDW-MWF then contains extra filtering operations that compensate for the performance degradation of „ 
the SDR-GSC with SDW-SWF due to speech leakage. In contrast to the SDR-GSC (and thus also the GSC), 
performance does not degrade due to microphone mismatch. In [22, 27] recursive implementations of the 
(SDW-)MWF have been proposed based on a GSVD or QR decomposition. A subband implementation [28] 
results in improved intelligibility at a significantly lower cost compared to the fullband approach. These 
techniques 3 can be extended, to implement the SDR-GSC and, more generally, the SP-SDW-MWF. 

In a third embodiment, described in Section 4, we propose cheap time-domain and frequency-domain 
stochastic gradient implementations of the SDR-GSC and SP-SDW-MWF. Starting from the design crite- 
rion of the SDR-GSC, or more generally, the SP-SDW-MWF, we derive a time-domain stochastic gradient 

'The implementation based on GSVD can only be used for the SP-SDW-MWF with filter w 0 . 
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algorithm. In addition, we modify the LMS based algorithm [26] so that it applies to the SP-SDW-MWF. To 
increase convergence and reduce complexity, a frequency-domain implementation has been proposed. Both, 
the stochastic gradient and LMS based algorithm suffer from a large excess error when applied in highly 
time- varying noise scenarios. We show that the excess error in the stochastic gradient algorithm is reduced 
by applying a low pass filter to the part of the gradient estimate that limits speech distortion. The low 
pass filtering avoids a highly time-varying distortion of the desired speech component while not degrading 
the tracking performance needed in time-varying noise scenarios. The stochastic gradient SP-SDW-MWF 
outperforms the LMS based algorithm, while complexity is not increased. Experimental results show that 
the low pass filtering significantly improves the performance of the stochastic gradient algorithm and does 
not compromise the tracking of changes in the noise scenario. In addition, experiments demonstrate that 
the proposed stochastic gradient algorithm preserves the benefit of the SP-SDW-MWF over QIC-GSC. The 
limited computational cost and the better noise reduction performance of the proposed algorithm make it a 
good alternative to the SPA [14] for implementation in hearing aids. 

3 Spatially pre-processed SDW Multi-channel Wiener filter 
3,1 Concept 

Figure 3 describes the Spatially pre-processed, Speech Distortion Weighted Multi-channel Wiener filter 
(SP-SDW-MWF). The SP-SDW-MWF consists of a fixed, spatial pre-processor, i.e., a fixed beamformer 
A(^) and a blocking matrix B(jz), and an adaptive Speech Distortion Weighted Multi-channel Wiener filter 
(SDW-MWF). Given M microphone signals 

^[k) = uf[k] + u?[fc], i = 1, M (33) 

with u\[k] the desired speech contribution and u?[h] the noise contribution, the fixed beamformer A(z) 
creates a so-called speech reference 

V 0 [k]=y a ok) + yZ[k], (34) 

by steering a beam towards the direction of the desired signal with a speech contribution Vo[k] and a noise 
contribution t/J W- ^ ^ e sequel an endfire array is assumed and the desired speaker is assumed to be in front 
at 0°. To preserve the robustness advantage of the MWF, the fixed beamformer A(z) should be designed 
so that the distortion in the speech reference yg[fc] is minimal for all possible errors in the assumed signal 
model such as microphone mismatch. In the sequel, a delay-and-sum beamformer is used. For small-sized 
arrays, this beamformer offers sufficient robustness against signal model errors as it minimizes the white 
noise gain or noise sensitivity 4 . Given statistical knowledge about the signal model errors that occur in 
practice, a further optimized beamformer A(z) can be designed, e.g., using the techniques in [31]. The 

4 The white noise gain or noise sensitivity is defined as the ratio of the spatially white noise gain to the gain of the desired signal 
and is often used to quantify the sensitivity of an algorithm against errors in the assumed signal model [2, 14]. 



blocking matrix B(z) creates M — 1 so-called noise references 



Vi[k] = Vf[k] +y?[%i = I-.-. M- 1 



(35) 



by steering zeroes towards the front so that the noise contributions y?[k] are dominant compared to the 
speech leakage contributions yf [k], A simple technique to create the noise references consists of pairwise 
subtracting the for 0° time-aligned microphone signals. Using [31, 34], further optimized noise references 
can be created. Speech leakage can then be minimized for a specified angular region around 0° instead of 
for 0°only, e.g., for an angular region from —20° to 20°. In addition, given statistical knowledge about the 
signal model errors that occur in practice, speech leakage can be minimized for all possible model errors by 
using [31]. 

In the sequel, the superscripts s and n are used to refer to the speech and noise contribution of a signal. 
During periods of speech + noise, the references yi[k] 9 i = 0, M — 1 contain speech + noise. During 
periods of noise only, yi[fc], i = 0, M — 1 only consist of a noise component, i.e., yi[k] = The 
second order statistics of the noise signal are assumed to be quite stationary such that they can be estimated 
during periods of noise only. 

The SDW-MWF filter 5 w 0: m-i 



with 



w$?M-i[*] = [w*[fcj w«[fc] ... w&.Jfc]], 

r it 
w 4 [A;] = [ w[0] w[l] ... w[L-l]\ 

yoV-iM = [y?[*l yf(fc] - y£-il*] ] . 

y<[*] = [viM Vi[k-l] ... y i [k-L + l]] T , 



(36) 



(37) 
(38) 
(39) 

(40) 



provides an estimate w^^yo^-i*:] of the nbise contribution yj[fc — A] 6 in the speech reference by 
minimizing the cost function J(wo. Af-i) 



JfroM-i) = ^g{|wg M _Jfel y g ;M _ 1 [fc)| 2 } + £{|yg[fe-A]-wgf M . 1 [fc]yg ;Af _ 1 [fc]| 2 } 

3 ^ 



(41) 



5 In a time-domain implementation, the input signals of the adaptive filter and the filter w 0: m-i are real and hence, Wq? m _ x = 
w O:M-i- In the sequel, the formulas are generalized to complex input signals so that they can also be applied to a subband 
implementation. 

6 The delay A is applied to the speech reference to make the filter w non-causal. Usually, it is set to f y ] , where [x] returns the 
smallest integer equal or larger than x . 



The subscript 0 : M - 1 in w 0: Af~i and yo:Af-i refers to the subscripts of the first and last channel 
component of the adaptive filter and input vector, respectively. The term e% represents the speech distortion 
energy and e£ the residual noise energy. The term in the cost function (41) limits the possible amount 
of speech distortion at the output of the SP-SDW-MWF. Hence, the SP-SDW-MWF adds robustness against 
signal model errors to the GSC by taking speech distortion explicitly into account in the design criterion of 
the adaptive stage. The parameter ~ 6 [0, oo) trades off between noise reduction and speech distortion: the 
larger ^, the smaller the amount of possible speech distortion. For fj, — 0, the output of the fixed beamformer 
A(z), delayed by A samples is obtained. In noise scenarios with very low Signal-to-Noise Ratio (SNR), 
e.g., —10 dB, a fixed beamformer may be preferred. Adaptivity can be easily reduced or excluded in the 
SP-SDW-MWF by decreasing to 0. Alternatively, adaptivity can be limited by applying a QIC to wq : m-i- 
Note that when the fixed beamformer A(z) and the blocking matrix B(z) are set to 



A(z) = [ 1 0 ... 0 ] J 



B{z) = 



0 1 
0 



0 10 
0 0 1 



(42) 



(43) 



we obtain the original SDW-MWF that operates on the received microphone signals Ui[k), i = 1, M. 

Below, the different parameter settings of the SP-SDW-MWF are discussed. Depending on the setting of 
the parameter fx and presence or absence of the filter w 0 , the GSC, the (SDW-)MWF as well as in-between 
solutions such as the Speech Distortion Regularized GSC (SDR-GSC) may be obtained. We distinguish 
between two cases, i.e., the case where no filter wq is applied to the speech reference (filter length L 0 = 0) 
and the case where an additional filter wq is used (Lq ^ 0). 

The adaptive stage of the SP-SDW-MWF can be implemented using the recursive QRD-based imple- 
mentation of the SDW-MWF [22]. Like for the SDW-MWF, complexity can be reduced by a subband 
implementation [23]. For Lq ^ 0, also the GSVD based algorithm [20] can be applied. Cheaper stochastic 
gradient based algorithms are proposed in Section 4. 



3.2 First embodiment: SDR-GSC, i.e M SP-SDW-MWF without w 0 

First, consider the case without wq, i.e. L 0 = 0. The solution for w 1:A f _i in (36) then reduces to 



arg min I £ {\^ M . x yl.M^iH 2 } + E - A] - ^ M ^yl M .x[k]\ 2 h 



-I 



el 



(44) 



leading to 




where e\ is the speech distortion energy and e£ the residual noise energy. 

Remark: For Lq = ft it is readily seen that does not hold, i.e., wi : ^! 4- w 1:A f_i =^ e A w/u?rc> 



because the speech component y\.M-\[k) in tne input to the adaptive filter w 1: Af_i<foes not contain the 
estimated speech signal yg[fc — A]. 

If \i = 1, the classical MMSE criterion (cfr. (17)) is obtained. 

Compared to the optimization criterion (6) of the GSC, a regularization term 



has been added. This regularization term limits the amount of speech distortion that is caused by the filter 
wi:M-i when speech leaks into the noise references, i.e., yf [k] ^ 0, i = 1, M — 1. In the sequel, we 
therefore refer to the SP-SDW-MWF with L 0 = 0 as Speech Distortion Regularized GSC (SDR-GSC). The 
smaller fi s the smaller the resulting amount of speech distortion will be. For p, = 0, the output of the fixed 
beamformer A(z) delayed by A samples, is obtained. For = oo, all emphasis is put on noise reduction and 
speech distortion is not taken into account. This corresponds to the GSC. Hence, the SDR-GSC encompasses 
the GSC as a special case. 

The regularization term ^{|wff M _ 1 [fc]yJ :M _ 1 [fc]| 2 } with ~ ^ 0 adds robustness to the GSC, while 
not affecting the noise reduction performance in the absence of speech leakage. 

• In the absence of speech leakage, i.e., yf[k] = 0, i = 1, M — 1, the regularization term equals 0 
for all wi : M-i and hence the residual noise energy e\ is effectively minimized. In other words, in the 
absence of speech leakage, the GSC solution is obtained. 

• In the presence of speech leakage* i.e., yf [k] ^ 0, i = 1, M — 1, speech distortion is taken 
into account in the optimization criterion (44) for the adaptive filter w, limiting speech distortion 
plus reducing noise. The larger the amount of speech leakage, the more attention is paid to speech 
distortion. 

To limit speech distortion alternatively, a QIC is often imposed on the filter wim-i (see Section 1.2). 
In contrast to the SDR-GSC, the QIC acts irrespective of the amount of speech leakage y 9 [k] that is 
present. The constraint value 0 2 in (1 1) has to be chosen based on the largest model errors that may 
occur. As a consequence, noise reduction performance is compromised even when no or very small 



Wl;Af-l 



^{KA^ytWll 2 } 



(47) 



model errors are present. Hence, the QIC is more conservative than the SDR-GSC. The experimental 
results in Section 3.4 confirm this. 



3.3 Second embodiment: SP-SDW-MWF with filter w 0 

Since the SDW-MWF (36) takes speech distortion explicitly into account in its optimization criterion, an 
additional filtering wq on the speech reference yo[k] may be added. The SDW-MWF (36) then solves the 
following more general optimization criterion 



w 0 :M-i = arg mm 

WQrM- 



el 

2" 



^{iK^-ll^lf}. 



(48) 



where w£ M _ x = [w" w{f M _ x ) is given by (36). 

Again, n trades off speech distortion and noise reduction. For fi — oo, speech distortion e \ is completely 
ignored so that the solution becomes 



0«], 



(49) 



which results in a zero output signal. For fx — 0, all .attention is paid to speech distortion so that the output 
of the fixed beamformer delayed by A samples, is obtained. 

• In the absence of speech leakage, i.e., y\ [k] = 0 for i = 1, M — 1, and for infinitely long filters 
w<, i = 0, M - 1, the SP-SDW-MWF with w 0 corresponds to the cascade of a SDR-GSC and a 
SDW Single-channel WF (SDW-SWF) postfilter [30, 35]. 

Proof: In case of infinite filter lengths, the SP-SDW-MWF W 0: m-i(/) and its optimization 
criterion can be represented in the frequency-domain: 



W 0 :M-i(/) = arg min 

Wo: 



iin_£ | [(exp(-i27r/A) - W 0 *(/)) -™?. M -i(f)} 

i.{|["K/> ^W/>][v ; ™)]|] 



(50) 



Without loss of generality, we assume -for reasons of simplicity- A = 0. 
Decompose W 1: Af-i(/) as 

Wuix-iif) = (1 - Wb(/)) W^^-iC/) (51) 

with Wb(/) a single-channel and W^irAf-if/) a multi-channel filter and define an intermediate 
output V(f ) (see also Figure 4) as 

V(f) - Ko(/) - W&jf.jC/jYwtf.it/). (52) 
Then, the cost function J(Wb, W d> i : a* -i) of (50) can be re-written as 

j = s {i(i - w?(/)) v"(/)i 2 } + if + yfiv.M.iUm. M .iU)\ 2 } ■ (53) 

From 5^J(W 0 , W^.,) = 0, we find 

Wb(/) = («{V"V-} + ^ {VV>*}) 1 (f {^ n ^*} - ^{WS-iW^-i)) , (54) 

This single-channel filter Wo(f) consists of two terms. 

- The first term 

W 0 ,i(/) = . (s{V n V n >*} + ±£{ }) 1 £{ V" V n '*} (55) 

estimates the noise component V™(/) in the intermediate output V(/). The filter 1 — Wb,i cor- 
responds to a SDW Single-channel Wiener Filter (SDW-SWF) that estimates the speech compo- 
nent V 3 (/). 

- The second term 

Wo, 2 (f) = W) + ±£{V*V*^ ~ l (-^{^Y5£-iWdLi:Jr-i}) (56) 

? 

estimates the speech leakage filtered by W t f i i ;A f^ 1 (/), i.e., -W^ 1;M _ 1 Yf :Af _ 1 . The speech 
component in the intermediate output V(f) equals V 9 (f) = YJf - W^ 1:M-1 Yf :Af-1 . The filter 
PVb,2(/) tries to compensate for the distortion — W^.^^YJ.^j by adding an estimate of 
W ^l:Af-i Y l:Af-i to Ae output of the SDW-SWF. 

In the absence of speech leakage (i.e., Yf :M _ 1 = 0), the filter Wo^if) equals zero and 1 - Wo(f) 
corresponds to a SDW-SWF. 



From b W a >v J(Wo t W dt i : M-i) = 0, we obtain the following solution for W^i-at-iC/): 

(^{Y? :M . a y 0 n '*} - ^{YtM-!^'*!^}) • (57) 

Also the multi-channel filter ^Wd,i:M-i(f) consists of two terms. 

- The first term corresponds to the SDR GSC 

(^{Y^M.iY^.J + iflYJ^.^.j) ^{YSm.^-*} (58) 

and estimates the noise component Yq(/) at the output of the fixed beamformer. 

- The second term tries to compensate for the speech distortion ~-Wq (/)^o (/) caused by Wb(/) 
by adding an estimate of [jffi^ *o (/) to the output of the SDR-GSC. Note that this corre- 
sponds to adding an estimate of W$ (f)Yg(f) to the output Z(f) of the SP-SDW-MWF. 

In the absence of speech leakage, Wj^m corresponds to a SDR-GSC or a GSC. 
Figure 4 illustrates graphically the solution for W<f,i:M-i(/) and W Q (f) for A = 0. In the absence of 
speech leakage, the filters that try to compensate for the speech distortion equal 0, hence, the SP-SDW- 
MWF corresponds to a SDR-GSC (or GSC) with SDW-SWF postfilter. The SP-SDW-MWF achieves 
the same or a better Signal-to-Noise Ratio (SNR) improvement than the SDR-GSC, depending on the 
noise scenario. ■ 

3.4 Experimental results 

This Section illustrates the theoretical results of Section 3.2 and Section 3.3 by means of experimental 
results for a hearing aid application. Section 3.4.1 and Section 3.4.2, respectively, describe the set-up and 
the performance measures that are used. In Section 3.4.3, the impact of the different parameter settings of 
the SP-SDW-MWF on the performance and the sensitivity to signal model errors is evaluated. Comparison 
is made with the QIC-GSC. 

3.4.1 Set-up 

A three-microphone Behind-The-Ear (BTE) hearing aid with three omnidirectional microphones (Knowles 
FG-3452) has been mounted on a dummy head in an office room. The interspacing d between the first and 
the second microphone is about d = 1 cm and the interspacing between the second and third microphone 
about 1.5 cm. The reverberation time T6o<ib is about 700 ms for a speech weighted noise. The desired 
speech signal and the noise signals are uncorrelated. Both the speech and the noise signal have a level of 
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70 dB SPL at the center of the head. The desired speech source and noise sources are positioned at a distance 
of 1 meter from the head: the speech source in front of the head, the noise sources at an angle 6 wxt. the 
speech source. To get an idea of the average performance based on directivity only, stationary speech and 
noise signals with the same, average long-term power spectral density are used. The signals can be found 
on [36]. The total duration of the input signal is 10 seconds of which 5 seconds contains noise only and 5 
seconds contain both the speech and noise signal. For evaluation purposes, the speech and noise signal have 
been recorded separately. 

The microphone signals are pre-whitened prior to processing to improve intelligibility [37], and the 
output is accordingly de-whitened. In the experiments, the microphones have been calibrated by means of 
recordings of an anechoic speech weighted noise signal positioned at 0° measured while the microphone 
array was mounted on the head. A delay-and-sum beamformer is used as a fixed beamformer, since -in case 
of small microphone interspacing - it is robust to model errors. The blocking matrix B pairwise subtracts 
the time aligned calibrated microphone signals. 

To investigate the effect of the different parameter settings (i.e. //, wq) on the performance only, the 
filter coefficients are computed using (36) where £{yo : Af-iyo!M-i} * s estimated by means of the clean 
speech contributions of the microphone signals. In practice, £{yo:Af-iyo!M-i} * s approximated using (30). 
The effect of approximation (30) on the performance was found to be small (i.e. differences of at most 
0.5 dB in intelligibility weighted Signal-to-Noise ratio improvement) for the given data set. The QIC-GSC 
is implemented using variable loading RLS [19]. The filter length L per channel equals 96. 

3.4.2 Performance measures 

To assess the performance of the different approaches, the broadband intelligibility weighted signal-to-noise 
ratio improvement [38] is used, defined as 



where the band importance function I{ expresses the importance of the z-th one-third octave band with 
center frequency /f for intelligibility, SNRi tOUt is the output SNR (in dB) and SNRi jin is the input SNR 
(in dB) in the i-th one third octave band. The center frequencies ff and the values /» are defined in [39]. 
The intelligibility weighted signal-to-noise ratio reflects how much intelligibility is improved by the noise 
reduction algorithms, but does not take into account speech distortion. 

To measure the amount of speech distortion, we define the following intelligibility weighted spectral 
distortion measure 



ASNRinteUig = 51 /*(SNR,,out - SNR^), 



(59) 




(60) 



i 



with SDj the average spectral distortion (dB) in i-th one-third band, measured as 



SD, 




/ |101og l0 G«(/)|# 

2->/«/f 



[( 2 l/6 _ 2 -l/6) f 



(61) 



with G 3 (f) the power transfer function of speech from the input to the output of the noise reduction algo- 
rithm. 

To exclude the effect of the spatial pre-processor, the performance measures are calculated w.r.t. the 
output of the fixed beamformer. 

3.4.3 Experimental results 

The impact of the different parameter settings for p. and wo on the performance of the SP-SDW-MWF is il- 
lustrated for a five noise source scenario. The five noise sources are positioned at angles 75°, 120°, 180°, 240°, 
285° wxt. the desired source at 0°. To assess the sensitivity of the algorithm against errors in the assumed 
signal model, the influence of microphone mismatch, e.g., gain mismatch of the second microphone, on 
the performance is depicted. Among the different possible signal model errors, microphone mismatch was 
found to be especially harmful to the performance of the GSC in a hearing aid application^ 7]. In hear- 
ing aids, microphones are rarely matched in gain and phase. In [3], gain and phase differences between 
•microphone characteristics of up to 6 dB and 10°, respectively, have been reported. 

SP-SDW-MWF without w 0 (SDR-GSC) 

Figure 5 plots the improvement ASNRinteiiig and the speech distortion SDintdiig as a function of ^ obtained 
by the SDR-GSC (i.e., the SP-SDW-MWF without filter w 0 ) for different gain mismatches T2 at the second 
microphone. In the absence of microphone mismatch, the amount of speech leakage into the noise references 
is limited. Hence, the amount of speech distortion is low for all p. Since there is still a small amount of 
speech leakage due to reverberation, the amount of noise reduction and speech distortion slightly decreases 
for increasing ^, especially for £ > 1. In the presence of microphone mismatch, the amount of speech 
leakage into the noise references grows. For ^ = 0 (GSC), the speech gets significantly distorted. Due to 
the cancellation of the desired sigifal, also the improvement ASNRinteiiig degrades. Setting £ > 0, improves 
the performance of the GSC in the presence of model errors without compromising performance in the 
absence of signal model errors. 

SP-SDW-MWF with filter w 0 

Figure 6 plots the performance measures ASNRinteiiig and SDinteWg of the SP-SDW-MWF with filter w 0 . 
In general, the amount of speech distortion and noise reduction grows for decreasing ^. For p = oo, 
all attention is paid to noise reduction. As also illustrated by Figure 6, this results in a total cancellation 
of the speech and the noise signal and hence degraded performance. In the absence of model errors, the 
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settings Lq = 0 and L 0 ^ 0 result - except for £ = 0 - in the same ASNRinteUig 7 , while the distortion 
for the SP-SDW-MWF with w 0 is higher due to the additional single-channel SDW-MWF. For L 0 ^ 0, the 
performance does -in contrast to Lq = 0 - not degrade due to the microphone mismatch. 

Comparison with QIC 

Figure 7 depicts the improvement ASNRinteUig and the speech distortion SDinteMg* respectively, of the QIC- 
GSC as a function of /3 2 . Like the SDR-GSC, the QIC increases the robustness of the GSC. The QIC is 
independent of the amount of speech leakage. As a consequence, distortion grows fast with increasing gain 
deviation. The constraint value (3 should be chosen so that the maximum permissible speech distortion level 
is not exceeded for the largest possible model errors. This goes at the expense of reduced noise reduction for 
small model errors. The SDR-GSC on the other hand, keeps the speech distortion limited for all model errors 
(see Figure 5). Attention towards speech distortion is increased if the amount of speech leakage grows. As a 
result, a better noise reduction performance is obtained for small model errors, while guaranteeing sufficient 
robustness for large model errors. In addition, Figure 6 demonstrates that an additional filter wq significantly 
improves the performance of the SP-SDW-MWF in the presence of signal model errors. 

3.5 Conclusion 

In the present invention, we established a generalized noise reduction scheme, referred to as Spatially pre- 
processed, Speech Distortion Weighted Multi-channel Wiener filter (SP-SDW-MWF), that consists of a fixed, 
spatial pre-processor and an adaptive stage that is based on a SDW-MWF. The new scheme encompasses the 
GSC and MWF as special cases. In addition, it allows for an in-between solution that can be interpreted as a 
Speech Distortion Regularized GSC. Depending on the setting of a trade-off parameter \i and the presence 
or absence of the filter w 0 on the speech reference, the GSC, the SDR-GSC or a (SDW-)MWF is obtained. 

In Section 3.2 and Section 3.3, the different parameter settings of the SP-SDW-MWF have been inter- 
preted. 

o Without w 0 , the SP-SDW-MWF corresponds to a SDR-GSC: the ANC design criterion is supple- 
mented with a regularization term that limits the speech distortion due to signal model errors. The 
larger ^, the smaller the amount of distortion. For = 0, distortion is ignored completely, which 
corresponds to the GSC-solution. The SDR-GSC is then an alternative technique to the QIC-GSC to 
decrease the sensitivity of the GSC to signal model errors. In contrast to the QIC-GSC, the SDR-GSC 
shifts emphasis towards speech distortion when the amount of speech leakage grows. In the absence 
of signal model errors, the performance of the GSC is preserved. As a result, a better noise reduction 
performance is obtained for small model errors, while guaranteeing robustness against large model 
errors. 

7 For Lq ^ 0, the SNR improvement was larger thanks to the single channel SDW MWF postfilter (see Section 3.3). For other 
noise sources, e.g., a narrow band noise source, also a better improvement in SNRinteiKg can be achieved by Lo ^ 0 thanks to the 
single channel spectral filtering. 
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• Since the SP-SDW-MWF takes speech distortion explicitly into account, a filter w 0 on the speech 
reference can be added. It is shown that -in the absence of speech leakage and for infinitely long filter 
lengths- the SP-SDW-MWF corresponds to a cascade of a SDR-GSC with a SDW-SWF postfilter. 
In the presence of speech leakage, the SP-SDW-MWF with w 0 tries to preserve its performance: 
compared to a SDR-GSC with SDW-SWF postfilter, the SP-SDW-MWF then contains extra filtering 
operations that compensate for the performance degradation of the SDR-GSC with SDW-SWF due to 
speech leakage. In contrast to the SDR-GSC (and thus also the GSC), performance does not degrade 
due to microphone mismatch. 

In Section 3.4, experimental results for a hearing aid application confirmed the theoretical results of Sec- 
tion 3.2 and Section 3.3. The SP-SDW-MWF indeed increases the robustness of the GSC against signal 
model errors. Comparison with the widely studied QIC-GSC demonstrated that the SP-SDW-MWF achieves 
a better noise reduction performance for a given maximum allowable speech distortion level. 

4 Third embodiment: Stochastic gradient implementations 

In [22, 27] recursive implementations of the MWF have been proposed based on a GSVD or QR decom- 
position. A subband implementation [28] results in improved intelligibility at a significandy lower cost 
compared to the fullband approach. These techniques can be extended to implement the SP-SDW-MWF. 
However, in contrast to the GSC and the QIC-GSC [14], no cheap stochastic gradient based implementation 
of the SP-SDW-MWF is available. In [25], an LMS based algorithm for the MWF has been developed. The 
algorithm needs recordings of calibration signals. Since room acoustics, microphone characteristics and the 
location of the desired speaker change over time, frequent re-calibration is required, making this approach 
cumbersome and expensive. In [26], an LMS based SDW-MWF has been proposed that avoids the need for 
calibration signals. The algorithm however relies on some independence assumptions that are not necessar- 
ily satisfied. In the present invention, we propose time-domain and frequency-domain stochastic gradient 
implementations of the SP-SDW-MWF that preserve the benefit of matrix-based SP-SDW-MWF over QIC- 
GSC. The LMS based SDW-MWF of [26] is modified so that it applies to the SP-SDW-MWF scheme. In 
addition, other stochastic gradient algorithms are developed that achieve a better performance. Experimental 
results demonstrate that the proposed stochastic gradient implementation of the SP-SDW-MWF outperforms 
the SPA, while its computational cost is limited. 

This section is organized as follows. Starting from the cost function of the SP-SDW-MWF, a time- 
domain stochastic gradient algorithm is derived in Section 4.1. Applying the independence assumptions 
made in [26] results in an LMS based SP-SDW-MWF similar to [26]. To increase convergence and reduce 
complexity, the stochastic gradient and LMS based algorithm are implemented in the frequency-domain. 
Both, the stochastic gradient and LMS based algorithm suffer from a large excess error, when applied in 
highly time-varying noise scenarios. In Section 4.2, we show that the performance of the stochastic gradient 
algorithm is improved by applying a low pass filter to the part of the gradient estimate that limits speech 



distortion. The low pass filtering avoids a highly time-varying distortion of the desired speech component 
while not degrading the tracking performance needed in time-varying noise scenarios. Section 4.3 compares 
the performance of the different frequency-domain stochastic gradient algorithms. Experimental results 
show that the proposed stochastic gradient algorithm preserves the benefit of the SP-SDW-MWF over the 
QIC-GSC. 

4.1 Stochastic gradient algorithm 
4.1.1 Derivation 

A stochastic gradient algorithm approximates the steepest descent algorithm, using an instantaneous gradient 
estimate. Given the cost function (41), the steepest descent algorithm iterates as follows 8 

= w[n] + p (s{y n y?*[k - A]} - fi{y w ^[*]}w[n] - ^{y*y*."[*]}w[n]) , (62) 

with w[fc], y[fc] € C NLxl , where N denotes the number of input channels to the adaptive filter and L the 
number of filter taps per channel. Replacing the iteration index n by a time index k and leaving out the 
expectation values £{•}, we obtain the following update equation 



w[k + l] = w[k]+p< 


y n [W[k - A] - y»-»[fc]w[*]) - iy*y**[A]w[k] 


1 




I rf*] J 





For £ = 0 and no filtering wq on the speech reference, equation (63) reduces to the update formula used in 
GSC during periods of noise only (i.e., when yi[k] = Vi[k] } j = 0, M - 1). The additional term r[k] in 
the gradient estimate limits the speech distortion due to possible signal model errors. 

Equation (63) requires knowledge of the correlation matrix y 9 y a>H [k] or £{y s y 3 ' H [k]} of the clean 
speech. In practice, this information is not available. To avoid the need for calibration, speech + noise 
signal vectors y&uft are stored into a circular buffer Bi € R NxLtni *i during processing as in [26]. During 
periods ofnoise only (i.e., when =yp[fc], i = 0, M-l), the filter wis updated using the following 
approximation of the term r[k] = j^y s y 8,H [k]v/[k] in (63) 

iyy ff [*]w[fc] « I (y^y^Jfc] - yy"[k)) wffl, (64) 

8 In the sequel the subscripts 0 : M - 1 in the adaptive filter w 0: m-i and the input vector yo:M-iare omitted for the sake of 
conciseness. 
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This results in the update formula 




(65) 



during periods of noise only. In the sequel, a normalized step size p is used, i.e., 



P = TT — 7T " T— . ( 66 > 



£ KftywkM - y"y W + y'yW + * 



where <5 is a very small constant The absolute value [y^yWi ~ y^yj ^ been inserted to guarantee 
a positive valued estimate of the clean speech energy y 5,H y*[fc]. Additional storage of noise only vectors 
y*ru/ 3 € C MLxl in a second buffer B2 £ R Wxi W2 allows to adapt w also during periods of speech + noise, 
using 

w[A + 1] = w[fc] + p {v^/Mm/A* - — yfifcwW) + i (yt„/ 2 y£ /a [A] - yy*M) w[fe]} (67) 

with 

p = — j — { . (68) 

I \y H y - yfru/ 3 y6u/ 2 1 + y£ /2 y&u/ 2 + s 

In the sequel, we will - for reasons of conciseness- only consider the update procedure of the time-domain 
stochastic gradient algorithms during noise only, hence, y[fe] = y n [fc]. The extension towards updating 
during speech + noise periods with the use of a second, noise only buffer B2 is straightforward: the equations 
are found by replacing the noise-only, input vectors y[k] by / 2 [A;] and the speech + noise vectors y^ f x [k] 
by the input speech + noise vector y[k]. 
Using 9 

wopt = (^{ywj£/J + (i - jpetyy"}) ' £{y H vW - A]}, (69) 

where y is a noise-only vector, and (65) it can be shown that 



fc+i 

(70) 



(1 1 \ fc+l 

i -psi-ybuhv^h + u - ^yy H y) 5 < w t°i - w <**> 

Hence, the algorithm (65)-(67) is convergent in the mean provided that the step size p is smaller than 
with A max the maximum eigenvalue of £{ j-y&u/iy^ + (1 - £)yy H }- Th e similarity of (65) with standard 
NLMS let us presume that setting p < ^NL Xi * with t = 1, NL the eigenvalues of ^{^ye^/iy^ + 

9 When the second order statistics of the noise are short-term stationary, w op t equals to (36). 



(1 - ji )yy H } e R NLxNL , or -in case of FIR filters- setting 

9 < . i^lElU + (i - 

guarantees convergence in the mean square. Equation (71) explains the normalization (66) and (68) for the 
step size p. 

However, since generally 

yy^^ykAy&l. (72) 

the instantaneous gradient estimate in (65) is -compared to (63)- additionally perturbed .by 

^(yy^fcj-y^y^lfc])^], (73) 

for ijl ^ oo. Hence, for p ^ oo, the update equation (65)-(67) suffers from a larger residual excess error 
than (63). The additional excess error grows for decreasing fi, increasing step size p and increasing vector 
length L.N of the vector y with L the filter length per channel and N the number of inputs to the adaptive 
filter. It is expected to be especially large for highly time-varying noise, e.g., multi-talker babble noise. 

4.1.2 NLMS based algorithm 

In [26], an LMS based implementation of the SDW-MWF has been proposed. Besides (64), some additional, 
independence assumptions are made. Applying these assumptions to (65)-(67), results in an LMS based 
implementation of the SP-SDW-MWF similar to [26]. Assuming that 



y^^ytu/xWySffc-A] = 0 (74) 



+ywJ%*I*l) - o, (75) 

hold, with k and I different time instants, (65) can be simplified to 



w[fc + 1] = w[fc] + -p-^—xI*]^ [*] - x»[fc]w[fc]) 



(76) 



where 



d[fc]=»o[fc-A]- r L=; x[A] = > /rrTy[/ : ] + yry tu/l [A] (77) 
V 1 ~ 5 



during periods of noise only (i.e., y[k] = y n [&]). During speech + noise (i.e., y[k] = y s [k] + y n [fc]), d[k] 
and x[fc] in (76) are set to 



. d[k] = yo,*«/ 3 [* - A]^JL=;x[fc] = 0TT y6u/2 [*] + yiy[fc]. (78) 

V 1 5 

Equations (74) and (75) assume that - besides speech and noise vectors - also noise vectors at different 
time instants are mutually uncorrelated. In practice, (74) and (75) do not hold, especially for large yf-^zj 

and yjji (l — ^J, i.e. for /i — * 1. Hence, compared to (65)-(67), performance is expected to be worse. 
In addition, equations (76)-(78) can - in contrast to (65)- not be applied for \i < 1. Compared to (65) no 
significant complexity reduction is achieved. The LMS based updating (76) requires 4NL + 3 Multiply- 
Accumulate (MAC) per sample 10 , whereas update formula (65) requires (4/VL + 5) MAC per sample. The 
computation of the normalized step size in (76) requires NL + 2 less MAC per sample than in (65). 

4,1.3 Frequency-domain implementation 

As stated before, the stochastic gradient algorithms (65)-(67) and (76) are expected to suffer from a large 
excess error for large £ and/or highly time-varying noise, due to a large difference between the rank-one 
noise correlation matrices y n y n ' H [k] measured at different time instants k. The gradient estimate can be 
improved by replacing 

in (65) with the time-average 

fc k . 

l=k-K+l l=Jfe-K+l 

where ^ Yli=k-K+i yWiy&u/i W is updated during periods of speech + noise and ^ Yli=k-K+i yy H M 
during periods of noise only. However, this would require expensive matrix operations. A block-based 
implementation intrinsically performs this averaging: 



\ K ~ 1 

w[(ifc + l)K) = w[kK] + \ Y] y[kK + i) (yl[kK + i-A]-y H [kK + i]^[kK}) 



K-l 



(81) 



'Note that the output y 0 [k - A] — w H y[k] of the algorithm stiU has to be computed. 



The gradient and hence also f x y^^ x [k] - yy H [k] is averaged over K iterations prior to make adjustments 
to w. This goes at the expense of a reduced (i.e. by a factor K) convergence rate. 

The block-based implementation is computationally more efficient when it is implemented in the frequency- 
domain, especially for large filter lengths. In addition, in a frequency-domain implementation, each fre- 
quency bin gets its own step size, resulting in faster convergence compared to a time-domain implementa- 
tion while not degrading the time-domain MSB. Although the frequency and time-domain implementation 
obtain the same MSE, the improvement in SNRjmeiiig, which is determined by the excess errors in each 
frequency bin, may be different In a time-domain implementation, one common step size p is used for the 
different frequency bins. The convergence rate depends on the eigenvalue spread of the correlation matrix of 
the input signals to the adaptive filter and hence on the power spectrum of the input signal. In frequency bins 
with little power this common step size will be smaller than in the frequency-domain approach, resulting in 
slower convergence and less excess error in that bin. In frequency bins with large power on the other hand, 
this common step size will be larger than in the frequency-domain approach, resulting in larger LMS ex- 
cess error in that frequency bin. Hence, in a time-domain implementation, the power spectrum of the input 
signals not only determines the convergence rate but also the improvement ASNRintdiig. In a frequency- 
domain implementation, the step size is normalized in each frequency bin, so that the different bins have 
a similar convergence rate and hence also excess error. Hence, the SNR improvement in each frequency 
bin is more controlled (i.e. less dependent on the input power spectrum). Since signal model errors (e.g., 
microphone mismatch) modify the power spectrum of the noise references and hence, the convergence rate 
and improvement ASNRinteiiig of a time-domain implementation, frequency-domain implementations are 
more appropriate to evaluate the performance of the algorithms for different signal model errors. 

Algorithm 1 and Algorithm 2 summarize a frequency-domain implementation based on overlap-save 
of (65)-(67) and (76), respectively. Algorithm 1 requires (ZN + 4) FFTs of length 2L and algorithm 2 
(3N 3) FFTs. By storing the FFT-transformed speech + noise and noise-only vectors in the buffers 11 
Bi € c NxLb »fi and B 2 G c NxL&tt ^, respectively, instead of storing the time-domain vectors, N FFT 
operations have been saved. When adapting during speech 4- noise, also the time-domain vector 

[yo[kL-A] — Vo [kL-A + L-l]] T (82) 

should then be stored in an additional buffer B2,o e R lx ~^ during periods of noise-only, which -for 
N = M- results in an additional storage of L *£ f 2 words compared to when the time-domain vectors are 
stored into the buffers Bi and B2. 

Remark : In algorithm J and 2 a common trade-off parameter /n is used in all frequency bins. Alterna- 
tively, a different setting for /x can be used in different frequency bins. E.g. for SPSDW-MWF with w 0 = 0, 
ix could be set to 00 at those frequencies where the GSC is sufficiently robust, e.g. for small-sized arrays at 

"Since the input signals are real, half of the FFT components are complex-conjugated. Hence, in practice only half of the 
complex FFT components have to be stored in memory. 



29 



Algorithm 1 Frequency domain stochastic gradient SP-SDW-MWF based on overlap-save. 

Initialization: 

w,[o] = [ o o ] T ,i = M-yv, ..„m-i 

P m [0]=6 m , m = 0, 2L-1 
Matrix definitions: 

6 =[ol J£]; ksss [° lL ]i F = 2Lx2LI)FTmatrix 

For each new block of NL input samples: 

• If noise detected: 

1. F [ Vi [kL - L] ... Vi [kL + X - 1] ] T , t = AT - N> AT - 1 — ♦ noise buffer B 3 
[ y 0 [*£ - A] ... y 0 [kL - A + L - 1] ] T -> noise buffer Ba |0 

2. Y^[fe] = diag {F [ j/i[JfeL - L] ... V <[JeL + L - 1] ] r } , i = Af - TV, M - 1 

Y4fc]==diag{[ BxfrO) ... Bi(< v 2L-l) ] T } , * = U - AT, M - 1 
cyclically shift each row i of Bj over 2L samples, i = M — N, ... t Af — 1 
d[fc]=[0 0 y 0 [fcX-A] ..-- y 0 [*L- A + JL-l ]] T 

• If speech detected: 

1. F [ yi[kL - L) ... yi[fcX + L - 1) ] T , i = M - AT, Af - 1 — speech + noise buffer B x 

2. Y 4 B pb] = diag{[ B 3 (*,0) ... B 3 (i, 2Z, - 1) ] T } , i = M - N, M - 1 
cyclically shift each row i of B2 over 21/ samples, t = Af - AT, Af — 1 

Y<[fc] = diag {F [ yi [kL — L] ... jfct&Z, + L - 1] ] T } , i = Af - N f M - 1 

d[*]=[0 0 B 3l0 (l,0) Ba.oO.i-l) ] T 

cyclically shift B 3 , 0 over £ samples 

• Update formula: 

1. ex[fc] = kF~ l E^m.n Y?[fc|WJ[fc] = you,, 
e[fc] = dl*)-ei[jfcj 

e 3 [*] = kF- 1 T,7=m-n ^[fc]W;[fcI = you,* 

Ex[fc) = Fk r Gl [it]; E 3 [fc] = Fk T e 3 [fc]; E[fc] = Fk T e[fc| 

2. A[fc) = 2£ldiag {P 0 - l [fc], .... Pil 1 -! [*]} 

P m [ fc ] = 7 /> m [* - 1] + (i - 7 ) (e^m 1 - w |v?, m | a + J IeJIm 1 -* (|Y,, m | 2 - |Y;, m | 3 ) |) 

3. Wi[* + 1) = W«[fc] + FgF-UW {Yr[fc]E*(fc] - i (YiEStfe) - Y?EI[M)} , i = M - N, Af - 1 

• Output: yopfcj = [ yo|fc!r — A] ♦ • y 0 [kL - A + L - 1) ] T 

— If noise detected: yout[fcl = yo[k] — yout,i[fc] 

- If speech detected: yout[fc] = yo[k] - yout, 3 [fc] 



Algorithm 2 Frequency domain NLMS based SP-SDW-MWF based on overlap-save. 

Initialization: — ~ 

W,[0] « [ 0 0 ] T , i = M - N, .... M-l 

P m [0} = 6 rnt m = 0, 2L-1 

Matrix definitions: 

8=s [ol ol] ;1cs::: [ 0 1l ]; F = 2Lx2LDFTrnatrix 

For each new block of 7VX input samples: 

• If noise detected: 

1. F [ Vi [kL - X) ... yi[kL + // - 1) ] r , t = Af - iV, Af — 1 — ♦ noise buffer B 2 
[ yol^L — A] ... yo{kL + ] T — ► noise buffer Ba.o 

2. Y4fc]=diag (f{ Vi[kL-L\ ... W [fcL + Ji-l] ] T } , t = M - tf, .... M-l 
Y^ u/l [A:]=diag{[ B!(i,0) ... Bi(t,2L-l) ] T } , i = M - N, M - 1 

Xi[fc] = ^Ty<[*) + yfZYwM i = ^ - AT, M-l 

cyclically shift each row i of buffer B* over 2L samples 

d[fc] = -^-l— [ 0 ... 0 yo[*L-A] y 0 [fc£-A+L-l ]] T 

• If speech detected: 

1. F [ yi[*£ - £] ... y»[*L + Z. - 1] ] T , i = M - M - 1 -» speech+noise bufferBi 

2. Yi[fc] = diag {f [ yi [kL - L\ ... yi [kL + L - 1] ] T } , t = M - ^, M - 1 
Yi,6u/ a W = diag{[ B 3 (i,0) ... B 2 (t,2L-l) ] T } , i = M - N, .... M - 1 

cyclically shift each row i of buffer B 2 over 2L samples 

d W = 7rrr[° ••■ 0 B *.o(i,o) ••■ B a .o(i.L-i) ] T 

cyclically shift Ba.o over L samples 

• Update formula: 

1. E[fc] = Fk T (d - kF" 1 E^m-w Xilfejw;^]) 

2- A[*i = ^diagtPo-'t*], 

F m [fc] = 7 A.[* - 1] + p.- T) (E£m_w |X j>m | a ) 

3. Wi[fc + 1) = W^Jfc) + FgF~ 1 A [*] X< [fe]E* [fcj, i = M - N, .... M-l 

• Output: yoM - kF" 1 Y«[*)W,' [*] 

y«W = [ yo[*=i - A] • • • yo[kL - A + £ - 1] ] T 



high frequencies. In that case, only a few frequency components of~Y i should be stored in the speech + noise 
buffer. 
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4.2 Improvement of stochastic gradient algorithm 

To achieve a reliable estimate (80) of the average correlation matrix £{y*y 3%H } in highly time-varying 12 
noise scenarios (e.g. multi-talker babble), K should be much larger than LN. Hence, the averaging in 
the block-based 13 or frequency-domain implementation proposed in Section 4.1, does not suffice to obtain 
a good estimate for £{y s y s ' H }. In this Section, we show that the performance of the stochastic gradient 
algorithm is improved by applying a low pass filter to the part of the gradient estimate that takes speech 
distortion into account, i.e., the term r[fc] in (65). The low pass filtering avoids a highly tune-varying 
distortion of the desired speech component while not degrading the tracking performance needed in non- 
stationary noise scenarios. 

4.2.1 Concept 
Define v/ 3 as 14 

w 5 = w - w n (83) 
w a € Range^yV'*}} (84) 
^{yV' ff }w„ = 0. (85) 

Then, the desired speech component z a [k] at the output equals 

* a [*]=vS[*-A]-w?y*[*]. (86) 

Assume that w a varies slowly in time. This is desired since a fast changing w a results in a highly time- 
varying distortion of the desired speech, and may thus harm sound quality. In addition, in hearing aid 
applications the average correlation matrix £{y s y** H } is slowly time-varying as microphone characteristics, 
room acoustics and the average desired speaker position do not change quickly in time. Fast changes in the 
noise scenario can be tracked by the filter w n . This will be illustrated in Section 4.3. 
Then, 

e{y'y** H }w[k}=£{y*y°> H }v, 3 ^ (87) 

can be approximated by 15 

^ybu/Jwr-yy^K = £{{vbuhy^ h - yy H ) wj, (88) 

"Like for the QR and GSVD based algorithms, we assume short-term stauonariry of the second order statistics of the noise, so 
that S{y n y n '"} « £{y£ u fl y£]£ }. The first and higher order statistics are allowed to vary fester in time. 
13 A large K » LN in block-LMS would result in a too slow convergence rate. 

,4 £{y*y* ,/f } is rask deficient when the speech leakage y, in the noise references does not cover the whole frequency spectrum 
or when the number of inputs N to the adaptive filter exceeds 1 and the direct-to-reverberant ratio of the desired speech is high. 

15 Just like for the matrix based algorithms, the noise correlation matrix £{y n y n,H ) is assumed to be short-term stationary so 
that it can be estimated during periods of noise only. 
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where y is a vector during noise only. Using the independence assumption [40] 

£{y n y n >"[k]w n {k]} = £ {y n y n > H [k)}£{w n [k]} 
and £ {y"y"- ff } = ^{y^y^}, we find that 

£ {(yfruAy^/, - yy H ) w 8 } = ^(y^y^ - yy") w[fc]}. 

Replacing the expectation value by time averaging, £{y 3 y 3,u }vf[k] can be estimated as 



l=k 



(89) 



(90) 



(91) 



during noise only 16 . The value K determines the convergence rate of the filter w s . 

Remark: In order to obtain a good estimate of S{y a y 3,H } f the long-term averaged noise correlation 
matrices £ Y?l=k-K W** M and 72\=k-K y^/j M jAouW no/ djflfer too much from each other This 
does not requires that the second order statistics of the noise source are stationary for about K time samples. 
It suffices that they are short-term stationary so that they can be estimated during noise only periods^ 

The averaging operation (91) is performed by applying the following low pass filter to the term r[k] = 

J(y6«/ l y£/ l -yy i/ )w[fc]in(65): 



r[*} = Xr[k - 1] + (1 - A)± ( yb u fl yg h - yy H ) w[*], 



(92) 



where A < 1. This corresponds to an averaging window K of about samples. The normalized step size 
p is modified into 



r a v 9 [k] + y H y + 6 



r avg [k] = \r avg [k - l] + (l - A)i \y^ h y buh - y H y\ 



(93) 
(94) 



Compared to (65), (92) requires 3NL - 1 additional MAC and extra storage of a NL x 1 vector r. 



4.2.2 Frequency-domain 

Equation (92) can be extended to the frequency-domain. The update equation for Wi(fc + 1] in algorithm 1 
then becomes: 

14 As also mentioned in Section 4.1, the noise-only vector y[fc] should be replaced by y&u/ 2 [*| and the speech + noise vector 
yi«*/i (M °y yM wben adapting during periods of speech + noise. 
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W<[ft + 1] 


= W<[fcj + FgP-^tfc] (Y^E*!*] - 




Ri[fc] 


= XRi[k - 1] + (1 - A)— (Y*[*]E3W 


-y?[*]e;[*]). 



(95) 



with 



E[k) = F^fyJ-kF- 1 £ Y?[fe]W;[fc] ) ; (96) 

M-1 

B x [k] = Fk T kF _1 Yj[ib]W;[fc]; (97) 

M-1 

E 2 [fc] = Fk T kF -1 ^2 Y^fcjW;!*]. (98) 



and p[fc] computed as follows: 



= ^diag {VW, .... i^-iW} 

Pm[fc] = P 1 ,m[fc]+P 2 .m[A:] 

M-1 

A,m[*] = 7ft.m[*-l] + (l-7) £ l Y £»| 2 

j=M-N 

M-1 



Pz, m [k] = AP 2 , m [A - 1] + (1 - A)i 



Compared to algorithm 1, (95>(98) requires one extra 2L-point FFT and 8NL - 2N - 2L extra MAC per 
L samples and additional memory storage of a 2iVL x 1 real data vector. To obtain the same time constant 
in the averaging operation as in the time-domain version with K = 1, A should equal \ L . 

Experimental results in Section 4.3 will show that the performance of the stochastic gradient algorithm 
significantly improves by the low pass filter, especially for large A. 

4.2.3 Complexity of different stochastic gradient algorithms 

Table 1 summarizes the computational complexity (expressed as the number of real multiply-accumulate 17 
(MAC), divisions (D), square roots (Sq) and absolute values (Abs)) of the time-domain (TD) and frequency- 
domain (FD) Stochastic Gradient (SG) and NLMS based algorithms. Comparison is made with standard 
NLMS and the NLMS based SPA. We assume that one complex multiplication is equivalent to 4 real mul- 
tiplications and 2 real additions. A 2L-point FFT of a real input vector requires 2L log 2 2L real MAC 

"counted as the number of multiply-accumulate, additions and multiplications. 
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(assuming radix-2 FFT algorithms). 

Table 1 indicates that the TD-SG without filter w 0 and the SPA are about twice as complex as the 
standard ANC. When applying a Low Pass filter (LP) to the regularization term, the TD-SG algorithm 
has about three times the complexity of the ANC. The increase in complexity of the frequency-domain 
implementations is less. 



Table 1: Computational complexity of TD and FD-NLMS and stochastic gradient algorithms (expressed as 
number of real MAC, divisions (D), absolute values (Abs) and square roots (Sq) per sample) 



Algorithm 


update formula 


adaptation of step size 


TD 


NLMSANC 


(2M - 2)L + 1) MAC | 


1D+(M- 1)LMAC 




NLMS based SPA 


(4(M - 1)L + 1) MAC+1 DM Sq 


1D+(A/ - 1)£MAC 




SG 


(4iVX + 5) MAC 


1 D + 1 Abs+(2iVL + 2) MAC 




NLMS based algorithm 


(4NL + 3) MAC 


1 XH-NL MAC 




SG with LP 


(7iVX + 4)MAC 


1 EH-1 abs+(2iVZ, + 4) MAC 


FD 


NLMS ANC 


(10M - 7 - 4 ^" 1 >) + (6M - 2)log 2 2L MAC 


1 D+(2M + 2) MAC 




NLMS based SPA 


(14M - 11 - liHpii + (6M - 2) log 2 2L MAC 
+l/£Sq+l/£,D 


1 D+(2M + 2) MAC 




SG 


(18N + 6- 2£)+(67V + 8)log 2 2LMAC 


1 04-1 abs + (4JV + 4) MAC 




NLMS based algorithm 


(16JV + 4 — m~ ) + (6N + 6) log 3 2Z,MAC 


1I>+(2JV + 2)MAC 




SG with LP 


(26JV + 4 - ^) + (6N + 10) log a 2LMAC 


1 EH-1 Abs+(4A r + 6) MAC 



Remark: In Table 1 and Figure 8, the complexity of time-domain and frequency-domain NLMSANC and 
NLMS based SPA represents the complexity when the adaptive filter is only updated during noise only If 
the adaptive filter is also updated during speech + noise using data from a noise buffer, the time-domain im- 
plementations require NL additional MAC per sample and the frequency-domain implementations require 
2 additional FFT and (4L(M - 1) - 2(M - 1) + L) MAC per L samples. 

As an illustration, Figure 8 plots the complexity (expressed as the number of Mega operations per second 
(Mops)) of the time-domain and frequency-domain stochastic gradient algorithm with LP filtering as a 
function of L for M = 3 and a sampling frequency f s = 16 kHz. Comparison is made with the NLMS- 
based ANC of the GSC and the SPA. The complexity of the FD SPA is not depicted, since for small M, 
it is comparable to the cost of the FD-NLMS ANC. For L > 8, the frequency-domain implementations 
result in a significantly lower complexity compared to their time-domain equivalents. The computational 
cost of the FD stochastic gradient algorithm with LP is limited, making it a good alternative to the SPA for 
implementation in hearing aids. 

4.3 Experimental results 

In this Section, we evaluate the performance of the different FD stochastic gradient algorithms based on 
experimental results for a hearing aid application. Comparison is made with the FD-NLMS based SPA. For 
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a fair comparison, the FD-NLMS based SPA is -like the stochastic gradient algorithms- also adapted during 
speech + noise using data from a noise buffer. 

4.3.1 Set-up 

A three-microphone Behind-The-Ear (BTE) hearing aid with three omnidirectional microphones (Knowles 
FG-3452) has been mounted on a dummy head in an office room. The interspacing d between the first and 
the second microphone is about d = 1 cm and the interspacing between the second and third microphone 
about 1 .5 cm. The reverberation time Teo <jb is about 700 ms for a speech weighted noise. The desired speech 
signal and the noise signals are uncorrelated. The desired speech source consists of sentences spoken by a 
male speaker. Both the speech and the noise signal have a level of 70 dB SPL at the center of the head. The 
desired speech source and noise sources are positioned at a distance of 1 meter from the head: the speech 
source in front of the head, the noise sources at an angle 0 w.r.t. the speech source. For evaluation purposes, 
the speech and noise signal have been recorded separately. 

The microphone signals are pre-whitened prior to processing to improve intelligibility [37], and the 
output is accordingly de-whitened. In the experiments, the microphones have been calibrated by means of 
recordings of an anechoic speech weighted noise signal positioned at 0° measured while the microphone 
array was mounted on the head. A delay-and-sum beamformer is used as a fixed beamformer, since -in case 
of small microphone interspacing - it is robust to model errors. The blocking matrix B pairwise subtracts 
the time aligned calibrated microphone signals. 

The performance of the FD stochastic gradient algorithms is evaluated for a filter length L = 32 taps per 
channel, pf — 0.8 and 7 = 0. To exclude the effect of the spatial pre-processor, the performance measures 
are calculated w.r.t. the output of the fixed beamformer. The sensitivity of the algorithms against errors in 
the assumed signal model is illustrated for microphone mismatch, e.g., a gain mismatch T2 = 4 dB of the 
second microphone. Among the different possible signal model errors, especially microphone mismatch 
was found to be harmful to the performance of the GSC in a hearing aid application [17]. In hearing aids, 
microphones are rarely matched in gain and phase. In [3], gain and phase differences between microphone 
characteristics of up to 6 dB and 10°, respectively, have been reported. 

4.3.2 Comparison of different FD stochastic gradient techniques 

Figure 9(a) and (b) compare the performance of the different FD Stochastic Gradient (SG) SP-SDW-MWT 
algorithms without w 0 (i.e., the SDR-GSC) as a function of the trade-off parameter fx for a stationary and 
non-stationary (e.g., multi-talker babble) noise source, respectively, at 90°. To analyze the impact of the 
approximation (64) on the performance, the result of a FD implementation of (63), which uses the clean 
speech, is depicted too. For both noise scenarios, the stochastic gradient algorithm significantly outperforms 
the NLMS based algorithm, especially for J — > 1. Without Low Pass (LP) filter, both algorithms achieve 
a worse improvement compared to (63), especially for large /a. For a stationary speech-like noise source, 
the FD-SG algorithm does not suffer too much from approximation (64). In a highly time-varying noise 



scenario, such as multi-talker babble, the limited averaging of r[k] in the FD implementation does not 
suffice to maintain the large noise reduction achieved by (63). The loss in noise reduction performance 
could be reduced by decreasing the step-size p\ at the expense of a reduced convergence speed. Applying 
the low pass filter (95) significantly improves the performance for all ^, while changes in the noise scenario 
can still be tracked. 

Figure 10 plots the improvement ASNRintdlig and SDintdiig of the SP-SDW-MWF = 0.5) with and 
without filter wo for the babble noise scenario as a function of where A is the exponential weighting 
factor of the LP filter (see (95)). Performance clearly improves for increasing A. For small A, the SP- 
SDW-MWF with wo suffers from a larger excess error -and hence worse ASNRintciiig- compared to the 
SP-SDW-MWF without w 0 . This is due to the larger dimensions of £{y a y a * H }. 

The LP filter avoids that the desired speech is distorted by a highly time-varying filter w a . In contrast 
to a decrease in step size pf> the LP filter does not compromise tracking of changes in the noise scenario. 
As an illustration, Figure 1 1 plots the convergence behavior of the FD stochastic gradient algorithm without 
wq (i.e., the SDR-GSC) for A = 0 and A = 0.9998, respectively, when the noise source position suddenly 
changes from 90° to 180°. A gain mismatch T2 of 4 dB was applied to the second microphone. To avoid fast 
fluctuations in the residual noise energy e„ and speech distortion energy e\, the desired and interfering noise 
source in this experiment are stationary, speech-like. The upper figure depicts the residual noise energy e„ 
as a function of the number of input samples, the lower figure plots the residual speech distortion during 
speech + noise periods as a function of the number of speech + noise samples. Both algorithms (i.e., A = 0 
and A = 0.9998) have about the same convergence rate. When the change in position occurs, the algorithm 
with A = 0.9998 even converges faster. For A = 0, the approximation error (64) remains large for a while 
since the noise vectors in the buffer are not up to date. For A = 0.9998, the impact of the instantaneous large 
approximation error is reduced thanks to the low pass filter. 

4.3.3 Comparison with SPA 

Figure 12 and Figure 13 compare the performance of the FD stochastic gradient algorithm with LP filter 
(A = 0.9998) and the FD-NLMS based SPA in a multiple noise source scenario. The noise scenario consists 
of 5 multi-talker babble noise sources positioned at angles 75°, 120°, 180°, 240°, 285° w.r.t. the desired 
source at 0°. To assess the sensitivity of the algorithms against errors in the assumed signal model, the 
influence of microphone mismatch, e.g., gain mismatch T2 == 4 dB of the second microphone, on the 
performance is depicted too. In Figure 12, the improvement ASNRmtciug and the distortion SD^i^g of the 
SP-SDW-MWF with and without filter w 0 is depicted as a function of the trade off factor Figure 13 
shows the results of the QIC-GSC 

w*w < 0 2 (99) 

for different constraint values p 2 9 which is implemented using the FD-NLMS based SPA. 

Both, the SPA and the stochastic gradient based SP-SDW-MWF increase the robustness of the GSC 
(i.e., the SP-SDW-MWF without w 0 and ~ = 0). For a given maximum allowable distortion SDinteUig, 
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the SP-SDW-MWF with and without w 0 achieve a better noise reduction performance than the SPA. The 
performance of the SP-SDW-MWF with w 0 is -in contrast to the SP-SDW-MWF without w 0 - not affected 
by microphone mismatch. In the absence of model errors, the SP-SDW-MWF with w 0 achieves a slightly 
worse performance than the SP-SDW-MWF without w 0 . With w 0 , the estimate of j;£ {y B y 9 * H } is less 
accurate due to the larger dimensions of ^£ {y*y*' H } (see also Figure 10). 

In short, the proposed stochastic gradient implementation of the SP-SDW-MWF preserves the benefit of 
the SP-SDW-MWF over the QIC-GSC. 

4.4 Conclusions 

In this paper, we derived time-domain and frequency-domain stochastic gradient algorithms for the SP- 
SDW-MWF and compared their performance to the SPA. Starting from the cost function of the SP-SDW- 
MWF, a time-domain stochastic gradient algorithm has been derived in Section 4.1. In addition, the LMS 
based algorithm [26] has been extended so that it applies to the SP-SDW-MWF. To increase convergence 
and reduce complexity, a frequency-domain implementation has been proposed. Both, the stochastic gra- 
dient and LMS based algorithm suffer from a large excess error when applied in highly time- varying noise 
scenarios. In Section 4.2, we show that the excess error is reduced by applying a low pass filter to the part of 
the gradient estimate that limits speech distortion. The low pass filtering avoids a highly time-varying distor- 
tion of the desired speech component while not degrading the tracking performance needed in time-varying 
noise scenarios. Section 4.3 compares the performance of the different frequency-domain stochastic gradi- 
ent algorithms for a hearing aid application. The stochastic gradient SP-SDW-MWF outperforms the LMS 
based algorithm, while complexity is not increased. For a non-stationary noise scenario, the LMS based and 
stochastic gradient SP-SDW-MWF suffer from a reasonably large excess error. Experimental results show 
that the low pass filtering significantly improves the performance of the stochastic gradient algorithm and 
does not compromise the tracking of changes in the noise scenario. In addition, experiments demonstrate 
that the proposed stochastic gradient algorithm preserves the benefit of the SP-SDW-MWF over QIC-GSC. 
The limited computational cost and the better noise reduction performance of the proposed algorithm make 
it a good alternative to the SPA for implementation in hearing aids. 
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Figure 1: Concept of the Generalized Sidelobe Canceller. 
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Figure 2: Equivalent approach of multi-channel Wiener filtering. 
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Figure 3: Spatially Pre-processed SDW MWF. 
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Figure 4: Decomposition of SP-SDW-MWF with wq in a multi-channel filter and single-channel post- 
filter ei — Wq. 
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Figure 5: Influence of on the performance of the SDR GSC for different gain mismatches T2 at the 
second microphone. 



2. 6 

a 

Z 

eg 2 



15 



CO 

2. 10 



Q 5 

CO 



0.5 



1.5 



0.5 



1.5 
1/HH 







T 2 = 6dB • 
.^.T z =2dB 

«- r 2 = 4dB " 



2.5 



■5 


T 2 = 6dB 
.^..T 2 = 2dB 
.». T 2 = 4dB " 


- \, 

^ 





2.5 



Figure 6: Influence of l/fi on the performance of the SP SDW MWF with wq for different gain mismatches 
T 2 at the second microphone. 
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Figure 7: ASNRiateiiigand SDinteiug for QIC-GSC as a function of /? 2 for different gain mismatches T2 at the 
second microphone. 
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Figure 8: Complexity (expressed in Mops) of TD and FD Stochastic Gradient (SG) algorithm with LP 
filtering as a function of filter length L per channel; M = 3. For comparison, the complexity of the standard 
NLMS ANC and SPA are depicted too. 
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Figure 9: Performance of different FD Stochastic Gradient (FD-SG) algorithms; (a) Stationary speech-like 
noise at 90° ; (b) Multi-talker babble noise at 90°. 
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Figure 10: Influence of LP filter on performance of FD stochastic gradient SP-SDW-MWF = 0.5) 
without wq and with w 0 . Babble noise at 90°. 
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Figure 1 1: Convergence behavior of FD-SG for A = 0 and A — 0.9998. The noise source position suddenly 
changes from 90° to 180° and vice versa. 



f A. 



, i!iM.*.7 


\7tr. 

. — □ — 


— -D -S □ 


- *- 




_.q.. - ■-€]— 

I— 








no w Q1 T 2 «0dB 












no w Q , T 2 ■ 4 dB 










• 


w Q) T 2 " 0 dB 










a 


w 0 ,T 2 = 4dB 





0.2 



0.4 0.6 



0.8 



15 

CO 

2. 10 



CT 5 

CO 



. + . no w Q1 T 2 = 0 dB 
. Q . no w 0 , T 2 » 4 dB 
•♦• w fl ,T 2 = 0dB 

o w 0 .r 2 -4dB 



0.2 



0.4 0.6 
1/M-l . 



0.8 



Figure 12: Performance of FD stochastic gradient implementation of SP-SDW-MWF with LP (A = 0.9998) 
in a multiple noise source scenario. 
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Figure 13: Performance of FD SPA in a multiple noise source scenario. 
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